Previously, we identified trends in winter temperatures by fitting a linear regression model to weather data. Here, we'll repeat this process by focusing on the optimizer. Specifically, we'll work with batch gradient descent and explore how changing the learning rate can alter its behavior.
The model we'll be working with will be the same linear regression model that we've used in other units. The principles we learn, however, also apply to much more complex models.
Let's load up our weather data from Seattle, filter to January temperatures, and make slight adjustments so that the dates are mathematically interpretable.
from datetime import datetime
import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/seattleWeather_1948-2017.csv
import graphing # Custom graphing code. See our GitHub repository
# Load a file that contains weather data for Seattle
data = pandas.read_csv('seattleWeather_1948-2017.csv', parse_dates=['date'])
# Remove all dates after July 1 because we have to to plant onions before summer begins
data = data[[d.month < 7 for d in data.date]].copy()
# Convert the dates into numbers so we can use them in our models
# We make a year column that can contain fractions. For example,
# 1948.5 is halfway through the year 1948
data["year"] = [(d.year + d.timetuple().tm_yday / 365.25) for d in data.date]
# Let's take a quick look at our data
print("Visual Check:")
graphing.scatter_2D(data,
label_x="year",
label_y="min_temperature",
title="Temperatures over time (°F)")
Requirement already satisfied: statsmodels in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.11.0) Requirement already satisfied: pandas>=0.21 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.1.5) Requirement already satisfied: scipy>=1.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.5.3) Requirement already satisfied: patsy>=0.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (0.5.2) Requirement already satisfied: numpy>=1.14 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.21.6) Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2.8.2) Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2022.1) Requirement already satisfied: six in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.16.0) --2023-08-18 12:03:06-- https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 21511 (21K) [text/plain] Saving to: ‘graphing.py.3’ graphing.py.3 100%[===================>] 21.01K --.-KB/s in 0s 2023-08-18 12:03:06 (98.9 MB/s) - ‘graphing.py.3’ saved [21511/21511] --2023-08-18 12:03:08-- https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/seattleWeather_1948-2017.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 762017 (744K) [text/plain] Saving to: ‘seattleWeather_1948-2017.csv.3’ seattleWeather_1948 100%[===================>] 744.16K --.-KB/s in 0.01s 2023-08-18 12:03:08 (59.2 MB/s) - ‘seattleWeather_1948-2017.csv.3’ saved [762017/762017] Visual Check:
Let's fit a line to this data well by using an existing library.
import statsmodels.formula.api as smf
# Perform linear regression to fit a line to our data
# NB OLS uses the sum or mean of squared differences as a cost function,
# which we're familiar with from our last exercise
model = smf.ols(formula = "min_temperature ~ year", data = data).fit()
# Print the model
intercept = model.params[0]
slope = model.params[1]
print(f"The model is: y = {slope:0.3f} * X + {intercept:0.3f}")
The model is: y = 0.063 * X + -83.073
Ooh, some math! Don't let that bother you. It's quite common for labels and features to be referred to as Y and X, respectively.
Here:
Y is temperature (°F).X is year.So this little equation states that the model estimates temperature by multiplying the year by 0.063 and then subtracting 83.
How did the library calculate these values? Let's go through the process.
The first step is always selecting a model. Let's reuse the model that we used in previous exercises.
class MyModel:
def __init__(self):
'''
Creates a new MyModel
'''
# Straight lines described by two parameters:
# The slope is the angle of the line
self.slope = 0
# The intercept moves the line up or down
self.intercept = 0
def predict(self, date):
'''
Estimates the temperature from the date
'''
return date * self.slope + self.intercept
def get_summary(self):
'''
Returns a string that summarises the model
'''
return f"y = {self.slope} * x + {self.intercept}"
print("Model class ready")
Model class ready
The automatic method used the ordinary least squares (OLS) method, which is the standard way to fit a line. OLS uses the mean (or sum) of square differences as a cost function. (Recall our experimentation with the sum of squared differences in the last exercise.) Let's replicate the preceding line fitting, and break down each step so we can watch it in action.
Recall that for each iteration, our training conducts three steps:
Estimation of Y (temperature) from X (year).
Calculation of the cost function and its slope.
Adjustment of our model according to this slope.
Let's implement this now to watch it in action. Note that to keep things simple, we'll focus on estimating one parameter (line slope) for now.
First, let's look at the error function for this data. Normally we don't know this in advance, but for learning purposes, let's calculate it now for different potential models.
import numpy as np
x = data.year
temperature_true = data.min_temperature
# We'll use a prebuilt method to show a 3D plot
# This requires a range of x values, a range of y values,
# and a way to calculate z
# Here, we set:
# x to a range of potential model intercepts
# y to a range of potential model slopes
# z as the cost for that combination of model parameters
# Choose a range of intercepts and slopes values
intercepts = np.linspace(-100,-70,10)
slopes = np.linspace(0.060,0.07,10)
# Set a cost function. This will be the mean of squared differences
def cost_function(temperature_estimate):
"""
Calculates cost for a given temperature estimate
Our cost function is the mean of squared differences (a.k.a. mean squared error)
"""
# Note that with NumPy to square each value, we use **
return np.mean((temperature_true - temperature_estimate) ** 2)
def predict_and_calc_cost(intercept, slope):
'''
Uses the model to make a prediction, then calculates the cost
'''
# Predict temperature by using these model parameters
temperature_estimate = x * slope + intercept
# Calculate cost
return cost_function(temperature_estimate)
# Call the graphing method. This will use our cost function,
# which is above. If you want to view this code in detail,
# then see this project's GitHub repository
graphing.surface(x_values=intercepts,
y_values=slopes,
calc_z=predict_and_calc_cost,
title="Cost for Different Model Parameters",
axis_title_x="Model intercept",
axis_title_y="Model slope",
axis_title_z="Cost")
The preceding graph is interactive. Try clicking and dragging the mouse to rotate it.
Notice how cost changes with both intercept and line slope. This is because our model has a slope and an intercept, which both will affect how well the line fits the data. A consequence is that the gradient of the cost function must also be described by two numbers: one for intercept and one for line slope.
Our lowest point on the graph is the location of the best line equation for our data: a slope of 0.063 and an intercept of -83. Let's try to train a model to find this point.
To implement gradient descent, we need a method that can calculate the gradient of the preceding curve.
def calculate_gradient(temperature_estimate):
"""
This calculates the gradient for a linear regession
by using the Mean Squared Error cost function
"""
# The partial derivatives of MSE are as follows
# You don't need to be able to do this just yet, but
# it's important to note that these give you the two gradients
# that we need to train our model
error = temperature_estimate - temperature_true
grad_intercept = np.mean(error) * 2
grad_slope = (x * error).mean() * 2
return grad_intercept, grad_slope
print("Function is ready!")
Function is ready!
Now all we need is a starting guess, and a loop that will update this guess with each iteration.
def gradient_descent(learning_rate, number_of_iterations):
"""
Performs gradient descent for a one-variable function.
learning_rate: Larger numbers follow the gradient more aggressively
number_of_iterations: The maximum number of iterations to perform
"""
# Our starting guess is y = 0 * x - 83
# We're going to start with the correct intercept so that
# only the line's slope is estimated. This is just to keep
# things simple for this exercise
model = MyModel()
model.intercept = -83
model.slope = 0
for i in range(number_of_iterations):
# Calculate the predicted values
predicted_temperature = model.predict(x)
# == OPTIMIZER ===
# Calculate the gradient
_, grad_slope = calculate_gradient(predicted_temperature)
# Update the estimation of the line
model.slope -= learning_rate * grad_slope
# Print the current estimation and cost every 100 iterations
if( i % 100 == 0):
estimate = model.predict(x)
cost = cost_function(estimate)
print("Next estimate:", model.get_summary(), f"Cost: {cost}")
# Print the final model
print(f"Final estimate:", model.get_summary())
# Run gradient descent
gradient_descent(learning_rate=1E-9, number_of_iterations=1000)
Next estimate: y = 0.0004946403321335834 * x + -83 Cost: 15374.06481788891 Next estimate: y = 0.034564263954523104 * x + -83 Cost: 3218.0503324264355 Next estimate: y = 0.050035120236006536 * x + -83 Cost: 711.4491469584556 Next estimate: y = 0.05706036350652576 * x + -83 Cost: 194.5815905316767 Next estimate: y = 0.060250493523378544 * x + -83 Cost: 88.00218235322374 Next estimate: y = 0.061699116600551045 * x + -83 Cost: 66.02523660294689 Next estimate: y = 0.06235692954504888 * x + -83 Cost: 61.493534346710646 Next estimate: y = 0.0626556393176375 * x + -83 Cost: 60.55908578536231 Next estimate: y = 0.06279128202425543 * x + -83 Cost: 60.36640010911301 Next estimate: y = 0.06285287674109104 * x + -83 Cost: 60.326667831309834 Final estimate: y = 0.06288066221361607 * x + -83
Our model found the correct answer, but it took a number of steps. Looking at the printout, we can see how it progressively stepped toward the correct solution.
Now, what happens if we make the learning rate faster? This means taking larger steps.
gradient_descent(learning_rate=1E-8, number_of_iterations=200)
Next estimate: y = 0.004946403321335834 * x + -83 Cost: 13267.277888290606 Next estimate: y = 0.06288803098785394 * x + -83 Cost: 60.31736349245315 Final estimate: y = 0.0629041077135948 * x + -83
Our model appears to have found the solution faster. If we increase the rate even more, however, things don't go so well:
gradient_descent(learning_rate=5E-7, number_of_iterations=500)
Next estimate: y = 0.24732016606679166 * x + -83 Cost: 133774.64171441036 Next estimate: y = 9.500952345613598e+45 * x + -83 Cost: 3.549071667291539e+98 Next estimate: y = 4.894806810765144e+92 * x + -83 Cost: 9.420015144175315e+191 Next estimate: y = 2.52176127646553e+139 * x + -83 Cost: 2.500278766819332e+285 Next estimate: y = 1.2991891572707298e+186 * x + -83 Cost: inf Final estimate: y = -2.2830799448007263e+232 * x + -83
Notice how the cost is getting worse each time.
This is because the steps that the model was taking were too large. Although it would step toward the correct solution, it would step too far and actually get worse with each attempt.
For each model, there's an ideal learning rate. It requires experimentation.
We've just fit one variable here to keep things simple. Expanding this to fit multiple variables requires only a few small code changes:
We need to update more than one variable in the gradient descent loop.
We need to do some preprocessing of the data, which we alluded to in an earlier exercise. We'll cover how and why in later learning material.
Well done! In this unit, we:
Watched gradient descent in action.
Saw how changing the learning rate can improve a model's training speed.
Learned that changing the learning rate can also result in unstable models.
You might have noticed that where the cost function stopped and the optimizer began became a little blurred here. Don't let that bother you. This is happens commonly, simply because they're conceptually separate and their mathematics sometimes can become intertwined.